feat(core): RAG retrieval overhaul (RRF + IVF_PQ), redaction hardening, and test coverage by jb-thery · Pull Request #46 · jcode-works/jcode-ragmir

jb-thery · 2026-07-03T17:27:53Z

Summary

RAG retrieval, redaction, config, and test-coverage overhaul driven by the deep audit. Split into 6 logical conventional commits.

Changes

chore: dist/ is now gitignored build output (built locally with pnpm build); dropped the CI committed-dist check.
feat(query): hybrid retrieval fuses vector + BM25 with weighted Reciprocal Rank Fusion (rank-only, no score calibration needed). Vector weight 0.7 / lexical 0.3. Recall stays 1.0 on the golden set.
feat(store): trains an IVF_PQ vector index automatically once the corpus reaches 256 rows (numPartitions ≈ √rows, clamped), keeping flat scan for small corpora. Unblocks query scalability beyond brute force.
fix(redaction): Luhn verification on credit-card numbers (no more over-redacting non-card digit runs), URL username now redacted alongside the password, plus Stripe / GitLab / Bearer providers and a verify: "luhn" opt-in on patterns.
feat(core): strict config schema (typos rejected), stderr warnings on invalid env overrides, access-log retention (trims past 10 MB), bounded LRU Transformers cache + clearTransformersCache(), and CLI option parsers extracted to a testable cli-options.ts.
test: suite 132 → 151 cases / 23 files — destroy, ask(), store manifest, embeddings, ingest --rebuild, config strict, access-log rotation, evaluate miss, redaction adversarial, CLI parsers, text tokenization.

Confidentiality posture verified

security-audit on the real monorepo index (681 chunks): zeroTelemetry=true, llmGeneration=false, transformersAllowRemoteModels=false, redactionEnabled=true, storageGitIgnored=true.

Checklist

pnpm lint clean
pnpm check clean
pnpm test — 151/151
pnpm build clean
pnpm smoke green (CLI + MCP 8 tools + license-webhook + release preflight)
commitlint 0 problems on all 6 commits

Move all packages/*/dist/ directories from committed artifacts to gitignored build output. dist/ is regenerated locally with `pnpm build` before running the CLI, MCP smoke, the library-API demo, or `pnpm validate`. - .gitignore: ignore ragmir-core/dist, ragmir-tts/dist (already ignored for app/landing/license-webhook); add *dist catch-all. - ci.yml: drop the `git diff --exit-code -- dist` step that enforced committed dist, since dist is no longer tracked. - AGENTS.md, CLAUDE.md, README.md, library-api-demo README: document that dist is gitignored and must be built locally; warn against `npx ragmir` for local testing (resolves the published npm package, not the working copy).

Replace the weighted-sum fusion (vector and BM25 scores divided by their max) with Reciprocal Rank Fusion, the standard hybrid-retrieval approach. Each candidate scores `weight / (RRF_K + rank)` per retriever it appears in, summed across retrievers, so the BM25 and vector score distributions never need calibration against each other. The vector retriever is weighted higher (0.7) than the lexical one (0.3) because, with the default local-hash embeddings, vector proximity is the more discriminant signal on small corpora; the lexical weight still lets exact- keyword evidence pull in candidates the vector retriever missed. - RRF_K = 60 (Cormack et al. 2009 constant). - Remove the now-unused weighted-sum helpers (vectorScore, normalizeScore) and the normalizeForMatch import left dead by the refactor. Retrieval recall stays at 1.0 on the sovereign-rag-demo golden set.

Above a 256-row threshold, automatically create an IVF_PQ index on the vector column after writing the table. Below the threshold, LanceDB keeps using an exact flat scan, which is optimal for small corpora and avoids wasted index- training work. - numPartitions ≈ sqrt(rowCount), clamped to [8, 1024] (LanceDB production heuristic). - numSubVectors = 16 (divides the 384-dim local-hash/mxbai-xsmall vectors). - index creation is idempotent (skipped if vector_idx exists) and best-effort (a training failure on edge-case dimensionality leaves the table usable via flat scan rather than failing the ingest). This unblocks query scalability beyond brute-force scan without changing the overwrite write path.

Close two confidentiality gaps and broaden provider coverage in the built-in redaction patterns: - credit_card: add a match-then-verify Luhn check (new RedactionPattern.verify field). Numeric runs that are not valid card numbers (version numbers, account IDs, hex runs) are left untouched instead of being over-redacted. - url_credentials: extend the pattern so both the username and the password are redacted. Previously only the password was stripped, leaking the username. - Add Stripe secret keys (sk_live/rk_live/sk_test), GitLab tokens (glpat-), and generic Bearer tokens. Order the more specific patterns before the generic api_token so they win on overlap. - Add an optional `verify: "luhn"` to the RedactionPattern type so custom patterns can opt into the same check.

…d use Several additive robustness and observability improvements, plus extraction of the CLI option parsers into a testable module: - config: make rawConfigSchema strict so unknown keys (typos) are rejected instead of silently ignored; warn on stderr when an env override (e.g. RAGMIR_TOP_K=abc) is invalid so operators notice a no-op override. - access-log: bound the log growth with a soft cap. When the file exceeds 10 MB, trim it to the most recent 50 000 lines before the next append, so a long-lived MCP server cannot grow it without limit or OOM a usage report. - embeddings: bound the Transformers.js pipeline cache to 3 entries with LRU eviction, and export clearTransformersCache(). destroyIndex now calls it so a re-ingest with a different embedding config does not pin stale ONNX weights. - cli-options: extract the pure option parsers (parsePositiveInt, parseNumber, parseRecallThreshold, audioEngine, audioAllowRemoteModels, audioLanguage, parseAgentInstallScope, parseAgentInstallMode) into a dedicated module so they can be unit-tested without importing commander. cli.ts imports them. parsePositiveInt now rejects fractional input like "1.5" instead of silently truncating via parseInt.

Close the test-coverage gaps the audit identified, raising the suite from 132 to 151 cases across 23 files: - destroy.test.ts (new): destroyIndex removed flag and access-log entry. - query.test.ts: ask() empty-sources and populated cited-retrieval branches. - store.test.ts: empty-text-files manifest round-trip, removal on empty, missing, malformed, and malformed-entry filtering; writeRows zero-rows dropTable and full re-write. - embeddings.test.ts: embedTexts([]) early return and clearTransformersCache. - ingest.test.ts: --rebuild forces a full re-index (reusedFiles === 0). - config.test.ts: strict() rejects unknown keys; non-object config rejected. - access-log.test.ts: retention trims past 10 MB; disabled logging writes nothing. - evaluate.test.ts: miss case (hit=false, bestRank=null, recall=0). - redaction.test.ts: Luhn pass/fail, URL username redacted, Stripe/GitLab/ bearer providers, obfuscation limitation documented. - cli.test.ts (new): all cli-options parsers incl. the MP3-without-engine confidentiality guard and agent scope/mode validation. - text.test.ts (new): tokenize/normalizeForMatch (the BM25 foundation).

github-actions · 2026-07-03T17:51:25Z

🎉 This PR is included in version 2.1.0 🎉

The release is available on:

v2.1.0
GitHub release

Your semantic-release bot 📦🚀

jb-thery added 6 commits July 4, 2026 00:26

jb-thery merged commit c6253d2 into develop Jul 3, 2026
7 checks passed

jb-thery deleted the feature/rag-retrieval-security-overhaul branch July 3, 2026 17:37

jb-thery mentioned this pull request Jul 3, 2026

release: RAG retrieval overhaul (RRF + IVF_PQ), redaction hardening, coverage #47

Merged

3 tasks

github-actions Bot added the released label Jul 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(core): RAG retrieval overhaul (RRF + IVF_PQ), redaction hardening, and test coverage#46

feat(core): RAG retrieval overhaul (RRF + IVF_PQ), redaction hardening, and test coverage#46
jb-thery merged 6 commits into
developfrom
feature/rag-retrieval-security-overhaul

jb-thery commented Jul 3, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jb-thery commented Jul 3, 2026

Summary

Changes

Confidentiality posture verified

Checklist

Uh oh!

Uh oh!

github-actions Bot commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant